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Abstract. We consider several novel aspects of unique factorization in 
formal languages. We reprove the familiar fact that the set uf(L) of words 
having unique factorization into elements of L is regular if L is regular, 
and from this deduce an quadratic upper and lower bound on the length 
of the shortest word not in uf(L). We observe that uf(L) need not be 
context-free if L is context-free. 

Next, we consider variations on unique factorization. We define a notion 
of “semi-unique” factorization, where every factorization has the same 
number of terms, and show that, if L is regular or even finite, the set 
of words having such a factorization need not be context-free. Finally, 
we consider additional variations, such as unique factorization “up to 
permutation” and “up to subset”. 


1 Introduction 

Let L be a formal language. We say x G L* has unique factorization if whenever 

x = 2/12/2 •••J/m = Z\Z2 ■■■z n 

for 3/i, 3/2, • • •, y m i Z U z 2, ■ ■ ■, z n € L then m = n and y, = Zi for 1 < i < m. If 
every element of L* has unique factorization into elements of L, then L is called 
a code. 

Although codes have been studied extensively (see, for example, [Ij), in this 
paper we look at some novel aspects of unique factorization. 

2 Unique factorizations 

Given L , we define uf(L) to be the set of all elements of L* having unique 
factorization into elements of L. We recall the following familiar fact: 

Proposition 1 . If L is regular, then so is uf(L). 
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Proof. If L contains the empty word e then no elements of L* have unique 
factorization, and so uf(L) = 0. So, without loss of generality we can assume 
e^L. 

To prove the result, we show that the relative complement L* — uf(L) is 
regular. Let L be accepted by a DFA M. On input x G L*, we build an NFA 
M' to guess two different factorizations of x and verify they are different. The 
machine M' maintains the single state of the DFA M for L as it scans the 
elements of x, until M' reaches a final state q. At this point M' moves, via 
an e-transition, to a new kind of state that records pairs. Transitions on these 
“doubled” states still follow M’s transition function in both coordinates, with 
the exception that if either state is in F , we allow a “reset” implicitly to go- 
Each implicit return to go marks, in a factorization, the end of a term. The final 
states of M' are the “doubled” states with both elements in F. 

More precisely, assume M = ( Q , A, 5, go, F). Since e fL L(M ), we know g 0 ^ 
F. We create the machine M' = (Q', E, S', g 0 , F') as follows: 

\{% 0 ,a), [<5(g 0 , a),S(q, a)]}, if g G F. 

Writing r = 5(p, a), s = S(q, a), t = S(qo, a), we also set 


S\[p, g],a) 


f{[r,s]}, if p<£F,q<£F\ 

I {[r,s], [t,s]}, ifpGE, g^F; 

I {ML [r,t]}, Ap<fF,q£F; 

[{[r,s], [f,s], [r,t], [t,t]}, if p G F, q G F. 


Finally, we set F' = F x F. To see that the construction works, suppose that 
x G L* has two different factorizations 


X = y 12/2 • • • VjVj+1 ■■■yk = 2 / 12/2 • • • VjZj+1 ■■•Zt 

with 2/j+i a proper prefix of Zj +Then an accepting path starts with singleton 
sets until the end of yj. The next transition goes to a pair having first element 
S(qo, a) with a the first letter of 2/j+i- Subsequent transitions eventually lead to 
a pair in F x F. 

On the other hand, if x is accepted, then two different factorizations are 
traced out by the accepting computation in each coordinate. The factorizations 
are guaranteed to be different because of the transition to [5(go, a), S(q , a)]. □ 

Remark 2. There is a shorter and more transparent proof of this result, as fol¬ 
lows. Given a DFA for L, create an NFA A for L* by adding e-transitions from 
every final state back to the initial state, and then removing the e-transitions 
using the familiar method (e.g., [2j Theorem 2.2]). Next, using the Boolean ma¬ 
trix interpretation of finite automata (e.g., [5] and 51 §3-8]), we can associate an 
adjacency matrix M a with the transitions of A on the letter a. Then, on input 
x = a\a 2 ■ • ■ at, a DFA can compute the matrix M x = M ai M a2 ■ ■ ■ M ai using 
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ordinary integer matrix multiplication, with the proviso that any entry that is 
2 or more is changed to 2 after each matrix multiplication. This can be done by 
a DFA since the number of such matrices is at most 3" where n is the number 
of states of M. Then, accepting if and only if the entry in the row and column 
corresponding to the initial state of A is 1, we get a DFA accepting exactly those 
x having unique factorization into elements of L. While this proof is much sim¬ 
pler, the state bound it provides is quite extravagant compared to our previous 
proof. 

Corollary 3. Suppose L is accepted by a DFA with n states. If L is not a code, 
then there exists a word x £ L* with at least two distinct factorizations into 
elements of L, with |.t| < n 2 +n. 

Proof. Our construction in the proof of Proposition|T]gives an NFA M' accepting 
all words with at least two different factorizations, and it has n 2 + n states. If 
M' accepts anything at all, it accepts a word of length at most n 2 + n — 1. □ 

Proposition 4. For all n > 2, there exists an 0{n)-state DFA accepting a 
language L that is not a code, such that the shortest word in L* having two 
factorizations into elements of L is of length I7(?r 2 ). 

Proof. Consider the language L n = b(a n )* U ( a n+1 )*b. It is easy to see that 
L n can be accepted by a DFA with 2n + 5 states, but the shortest word in L* 
having two distinct factorizations into elements of L n is ba n ( n+1 ^ b, of length 
n 2 + n + 2. □ 

In fact, there are even examples of finite languages with the same property. 

Proposition 5. For all n > 2, there exists an 0{n)-state DFA accepting a 
finite language L that is not a code, such that the shortest word in L* having 
two factorizations is of length f2(n 2 ). 

Proof. Let E = {b, aq, a 2 ,..., a n } be an alphabet of size n + 1, and let L n be 
the language of 2 n words 

{ai,a„} U {b l ai+i : 1 < i < n} U {cq&* : 1 < i < n} 
defined over E. 

Then it is easy to see that L n can be accepted with a DFA of 2n + 2 states, 
while the shortest word having two distinct factorizations is 

aiba 2 b 2 a 3 b 3 ■ ■ ■ a„_i b n ~ l a n , 

which is of length n{n + l)/2. □ 

Remark 6. The previous example can be recoded over a three-letter alphabet by 
mapping each a* to the base-2 representation of i, padded, if necessary, to make 
it of length I, where l = [log 2 n]. With some reasonably obvious reuse of states 
this can still be accepted by a DFA using 0(n ) states, and the shortest word 
with two distinct factorizations is still of length 12 (n 2 ). 
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Theorem 7. If L is a CFL, then uf(L) need not he a CFL. 

Proof. Let L = PALSTAR, the set of all strings over the alphabet S = {0,1} that 
are the concatenation of one or more even-length palindromes. Clearly L is a 
CFL. Then uf(L) = PRIMEPALSTAR, which was proven in [3] to be non-context- 
free. (Here PRIMEPALSTAR is the set of all elements of PALSTAR that cannot be 
written as the product of two or more elements of PALSTAR.) □ 

3 Semi-unique factorizations 

We now consider a variation on unique factorization. We say that x £ L* has 
semi-unique factorization if all factorizations of x into elements of L consist of 
the same number of factors. More precisely, x has semi-unique factorization if 
whenever 

x = 2 / 12/2 •••2/m = - 1-2 ■■■z n 

for 2 / 1 , 2 / 2 , • ■ •,2/m, zi,z 2 ,...,z n e L, then m = n. 

Given a language L, we define su (L) to be the set of all elements of L* having 
semi-unique factorization over L. 

Example 8. Let L = {a,ab, aab}. Then su(L) = ( ab)*a *. 

Theorem 9. If L is regular, then su(L) is a CSL and a co-CFL. 

Proof. To see that L is a co-CFL, mimic the proof of Proposition |T] We use a 
stack to keep track of the difference between the number of terms in the two 
guessed factorizations, and another flag in the state to say which, the “top”, 
or the “bottom” state, has more terms (since the stack can’t hold negative 
counters). We accept if we guess two factorizations having different numbers of 
terms. 

To see that L is a CSL, note that su (L) is decidable in DSPACE(n). (All we 
need to do is enumerate all the possible factorizations; since no factorization is 
longer than the word itself, we can list them all in linear space.) □ 

Corollary 10. Given a regular language L, it is decidable if there exist elements 
x G L* lacking semi-unique factorization. 

Proof. Given L, we can construct the PDA accepting L* — su(L). We convert this 
PDA to a CFG G generating the same language (e.g., [2] Theorem 5.4]). Finally, 
we use well-known techniques (e.g., [2;, Theorem 6.6]) to determine whether L(G) 
is empty. □ 

Theorem 11. If L is regular then su (L) need not be a CFL. 

Proof. Let 

L = a0 + b + 1 + c(23)+ + 23d + a + 0 + 61 + c(23)+ + aO+M+c2 + 32 + 3d. 
Consider su(L) and intersect with the regular language a0 + 61 + c(23) + d. 
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Then there are only three possible factorizations for a given word here. They 
look like (using parens to indicate factors) 

(a0*6)l • 1 • 1 • - • l(c(23) fe )(23d), which has j + 3 terms if j is the number of 
i’s; 

(a)0 • 0 • • - 0(&Pc(23) fc )(23d), which has i + 3 terms if i is the number of 0’s; 

and 

(a0*6Fc2)(32)(32) • • • (32)(3d), which has k + 2 terms, if k is the number of 
(32)’s. 

So if all three factorizations have the same number of terms we must have 
i = j = k — 1 which gives us 

{a0”61 n c(23)" _1 d : n > 1} 


which is not a CFL. 


□ 


There are even examples where L is finite. For expository purposes, we give 
an example over the 21 -letter alphabet 


£ = { 0 , 1 ,2,3,4,5, 6 ,7, 8 , a, b,c, d,e,f,g,h,i,j,k,l}. 


Theorem 12. If L is finite, then su (L) need not be a CFL. 
Proof. Define 


L\ = (0 ab, cd, ab, cdl27, efgh, efgh3, Aijkl , ijkl , 5, 68 } 
Z /2 = {Oa&c, dabc , dl, 27e, fg, he, h34ij, klij, kl 568} 

L-j, = {Qa,bcda,bcdl2,7ef, ghef, gh34i, jk,li, jklbQ,8} 


and set L := L\ U L 2 U L 3 . 

Consider possible factorizations of words of the form 


0(abcd) m 127 (efgh) n 34(ijkl) p 568 

for some integers m,n,p > 1 . Any factorization of such a word into elements of 
L must begin with either 0 ab, 0 abc, or 0a. There are three cases to consider: 


Case 1: the first word is Oab. Then the next word must begin with c, and there 
are only two possible choices: cd and cdl27. If the next word is cd then since no 
word begins with 1 the only choice is to pick a word starting with a , and there 
is only one: ab. After picking this, we are back in the same situation, and can 
only choose between cd followed by ab, or cdl27. Once cdl27 is picked we must 
pick a word that begins with e. However, there are only two: efgh and efgh3. 
If we pick efgh we are left in the same situation. Once we pick efghi we must 
pick a word starting with 4, but there is only one: 4 ijkl. After this we can either 
pick 5 and then 68 , or we can pick ijkl a number of times, followed by 568. 
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(' 0ab)((cd)(ab)) m ~ 1 (cdl27)(efgh) n ~ 1 (efgh3)(Aijkl)(ijkl) p ~ 1 (5)(68 ) 

having 1 + 2 (to — 1) + 1 + (n — l) + l + l + (p — 1) + 1 + 1 = 2 m + n + p + 2 
terms. 

Case the first word is 0 abc. Then the next word must begin with d, and there 
are only two choices: dabc and dl. If we pick dabc we are back in the same 
situation. If we pick dl then the next word must begin with 2, but there is only 
one such word: 27e. Then the next word must begin with /, but there is only 
one: fg. Then the next word must begin with h, but there are only two: he and 
h3Aij. If we pick he we are back in the same situation. Otherwise we must have 
a word beginning with k, but there are only two: klij and kl 568. This gives the 
factorization 


(0 abc)(dabc) m 1 (dl)(27e)((fg)(he)) n 1 (fg)(h3Aij)(klij) p 1 (fcl568) 
having 1 + (to — 1) + 2 + 2(n — !) + ! + ! + (p — 1) + 1 = m + 2n + p + 2 terms. 


Case 3: the first word is 0a. Then only bcda and bcdl2 start with b, so we must 
choose bcda over and over until we choose bcdl2. Only one word starts with 7 so 
we must choose 7 ef. Now we must choose ghef again and again until we choose 
gh34i. We now choose jk and li alternately until jkl56. Finally, we pick 8. 

This gives us a factorization 

(0a)(bcda) m ~ 1 (bcdl2)(7ef)(ghef) n ~ 1 (gh3Ai)((jk)(li)) p ~ 1 (jkl56)(8) 
with 1 + (to — 1) + 2 + (n — 1) + 1 + 2 (p — 1) + 2 = m + n + 2p + 2. 

So for all these three factorizations to have the same number of terms, we 
must have 


2m + n+ p+ 2 = m + 2n + p + 2 = m + n + 2p+2. 

Eliminating variables we get that m = n = p. So when we compute su(L) and 
intersect with the regular language 0 (abcd) + 127 (efgh) + M(ijkl ) + 568 we get 

{0(abcd) n l27{efgh) n M{ijkl) n 568 : n > 1}, 

which is clearly a non-CFL. □ 


Remark 13. The previous two examples can be recoded over a binary alphabet, 
by mapping the i’th letter to the string bd l b. 
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4 Permutationally unique factorization 

In this section we consider yet another variation on unique factorization, which 
are factorizations that are unique up to permutations of the factors. 

Formally, given a language L we say x £ L* has permutationally unique 
factorization if whenever x = 2/12/2 • • • y m = 21^2 • • • z n for 

yi,y 2 ,---,y m ,zi,z 2 ,...,z n e L, 

then m = n and there exists a permutation cr of (1,..., n} such that //» = ^ CT (i) 
for 1 < i < n. In other words, we consider two factorizations that differ only 
in the order of the factors to be the same. We define ufp(L) to be the set of x 
having permutationally unique factorization. 

Example If. Consider L = {a 3 , a 4 }. Then 

ufp(L) = (a 3 , a 4 , a 6 , a 7 , a 8 , a 9 , a 10 , a 11 , a 13 , a 14 , a 17 }. 

Theorem 15. If L is finite then ufp(L) is a CSL and a co-CFL. 

Proof. The claim about CSL should be clear. 

We sketch the construction of a PDA accepting ufp(L). If a word is in L* 
but has two permutationally distinct factorizations, then there has to be some 
factor appearing in the factorizations a different number of times. Our PDA 
nondeterministically guesses two different factorizations and a factor t £ L that 
appears a different number of times in the factorizations, then verifies the fac¬ 
torizations and checks the number. It uses the stack to hold the absolute value 
of the difference between the number of times t appears in the first factorization 
and the second. It accepts if both factorizations end properly and the stack is 
nonempty. □ 

Theorem 16. If L is finite then ufp(L) need not be a CFL. 

Proof. Let S = {a, b, c}. Define L = {A, B, Si, S 2 , T\, T 2 ] C S + as follows: 


A = aa, B = aaa, Si = ab, S 2 = ac, Tf = ba, T 2 = ca. 


Let R = aa(ab) + (ac) + aa(ba) + (ca) + aaa, and consider words of the form 

w := aa(ab) r (ac) s aa{baY(ca) q aaa £ ufp(L) fl R 

with r, s,t,q > 1 and the following two factorizations of w: 

AS^S^ATlTf B = aa • ( ab) r ■ ( ac) s ■ aa ■ ( ba ) 4 • ( ca) q ■ aaa (1) 

BTfT^S{S%AA = aaa ■ ( ba) r ■ ( ca) s ■ ( ab ) 4 • ( ac) q ■ aa ■ aa (2) 

It is not difficult to see that w must be of one of these two forms. Since w has 
prefix aaab, it must start with either ASi or BT\. If it starts with ASi = aa ■ ab, 
the next factors must be S[ _1 to match ( ab) r , so we have AS\. We then see 
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(ac) s , which can only match with S%- Next, we see l aaba\ thus we must choose 
ATi = aa ■ ba. We then have ( 6 a) t_1 , which can only match with T[~ , and then 
( ca) q , matching only with Tf. Finally the suffix is ‘aaa’ which can only match 
with B as required. 

If w starts with BT\ = aaa • ba, the next part is (&a) r , which only matches 

with T[~ . Then we see (ca) s , so we must use factors T|. We then see (a&)* and 
( ac) q , matching with S[ and respectively. Finally we have ‘aaaa’ matching 
only with A A as required. 

If r = t and s = q, then the number of each factor [A, B, Si, T 2 ) 

in factorizations HD and @ is identical. Therefore, w always has more than 
one factorization (of type (JTJ or ©); however, that factorization is only non- 
permutationally equivalent if r 7 ^ t or s ^ q. Therefore 

ufp(T) fl R= {aa • ( ab) r ■ ( ac) s ■ aa ■ ( ba )* • ( ca) q ■ aaa \ (r = t) A (s = (/)} 

= {AS[SIAT[T£B : r,a>l}, 

which is not a context-free language. □ 

5 Subset-invariant factorization 

In this section we consider yet another variation on unique factorization. We say a 
word x G L* has subset-invariant factorization (into elements of L) if there exists 
a subset S C L with the property that every factorization of x into elements of 
L uses exactly the elements of S' — no more, no less — although each element 
may be used a different number of times. More precisely, x has subset-invariant 
factorization if there exists S = S(x) such that whenever x = yiy- 2 - ■ • y m with 
?/i, 2 / 2 , ■ • ■, Vm £ L, then S = { 2 / 1 , 2 / 2 , • • •, ym}- We let ufs(L) denote the set of 
those x £ L* having such a factorization. 

Theorem 17. If L is finite then ufs(L) is regular. 

Proof. The proof is similar to the proof of Theorem [15] above. On input x we 
nondeterministically attempt to construct two different factorizations into ele¬ 
ments of L, recording which elements of L we have seen so far. We accept if we 
are successful in constructing two different factorizations (which will be different 
if and only if some element was chosen in one factorization but not the other). 
This NFA accepts L* — ufs(L). So if L is finite, it follows that ufs(L) is regular. 

In more detail, here is the construction. States of our NFA are 6 -tuples of 
the form [uq, si,Vi,W 2 ,S 2 , t> 2 ] where w±,w 2 are the words of L we are currently 
trying to match; si,S 2 are, respectively, the suffixes of w±, w 2 we have yet to 
see, and v±,V 2 are binary characteristic vectors of length |L|, specifying which 
elements of L have been seen in the factorization so far (including w\ and W 2 , 
although technically they may not have been seen yet). Letting C(z) denote the 
vector with all 0’s except a 1 in the position corresponding to the word z £ L, 
the initial states are [w,w,C(w),x,x,C(x)] for all words w,x £ L. The final 
states are of the form [w,e,vi,x,e,V 2 ] where v\ ^ iq. Transitions on a letter a 
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look like S([wi,asi,Vi,W2,as2,v 2 ],a) = [wi, Si,v±,W2, S2,v 2 ]. In addition there 
are e-transitions that update the corresponding vectors if si or S 2 equals e, and 
that “reload” the new w± and w 2 we are expecting to see: 


8{[w 1 ,t,v 1 ,W2,S2,V2],e) = {[w,w,v-L V C{w),W2,S2,V2] ■ w £ L} 
S([wi,Si,Vi,W2,e,V2\,e) = {[wi,Si,Vi,w,w,V2 V C(w)] : w £ L}. 


□ 

The preceding proof also shows that the shortest word failing to have subset- 
invariant factorization is bounded polynomially: 

Corollary 18. Suppose \L\ = n and the length of the longest word of L is to. 
Then if some word of L* fails to have subset-invariant factorization, there is a 
word with this property of length < 2m 2 n 2 . 

Proof. Let u £ L + be a minimal length word such that u £ L + — ufs(L). Consider 
the states of the NFA traversed in processing u. Let So := [w, w, C{w), x, x,C(x)\ 
be the initial state and Sf '■= e, vf, %f, e, v' F \ the final state, where vf ^ v' F . 
By definition, there must exist some z £ L such that v F and v' F differ on C(z), 
i.e., v F ■ C(z) + v F T ■ C(z) = 1. 

Initially the characteristic vectors have a single 1, and once an element is set 
to 1 in a characteristic vector in the NFA, it is never reset to 0. Thus, there 
exists some 1 < k < |w| such that u = u\■■ ■ Uk-i ■ Uk • Uk+\---u\ u \ where 
Sk-i = 6 (Sq,ui ■ ■ ■ Uk- 1 ) has a 0 in the characteristic vectors at position z, and 
5(Sk-i,Uk) has a 1 in exactly one of the two characteristic vectors at position 
z. We shall now prove that |ui • • • Uk-i \, |wfc+i • • • iti„i| < m 2 n 2 , which proves the 
result. 

We prove the result for the word v = Ui ■ ■ -Uk~i', a similar analysis holds 
for Uk+i ■ • • u\ u \. Let So, Si,... Sk-i be the states of the NFA visited as we pro¬ 
cess v. We prove that there does not exist 0 < i < j < fc — 1 such that Si = 
[toi, Si, v\, W 2 , S 2 , V 2 ] and Sj = [wi, si,v[,W 2 , S 2 ,v' 2 ]. We proceed by contradic¬ 
tion. Assume such an i and j exist. Then it, + i • • -uj is such that S(Si, rq+i • • • Uj) — 
Sj. However, S(Si, Uj+i ■ ■ ■ Uk) and 5(Sj,Uj +1 • • • Uk) can only differ in their bi¬ 
nary characteristic vectors, since the transition function does not depend upon 
the characteristic vectors when we update the words Wi, si,w 2 , s 2 - Thus, we can 
remove the factor Ui+i ■ ■ ■ u :j from u and still reach a final state of the form 
Sf 2 [wF,e,VF 2 i x F,e,v F2 \, for which we still have that vf 2 7 ^ v 'f 2 > si nce they 
differ on element 2 due to letter Uk- Continuing this idea iteratively, the maximal 
number of states k is bounded by m 2 n 2 . Doubling this bound gives the result. 

□ 


The next result shows that we can achieve a quadratic lower bound. 

Proposition 19. There exist examples with \L\ = 2 n and longest word of length 
n for which the shortest word of L* failing to have subset-invariant factorization 
is of length n(n + l)/2. 
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Proof. We just use the example of Proposition [5] □ 

Theorem 20. If L is regular then ufs(L) need not be a CFL. 

Proof. We use a variation of the construction in the proof of Theorem 1161 Let 
L = ( ab) + (ac) + aa + (ba) + (ca) + + aa + aaa. Then (using the notation in the 
proof of Theorem flGl) , if 

w := aa(ab) r (ac) s aa(baY(ca) q aaa £ ufs(L) Cl R 

with r, s,t,q > 1 then there are two different factorizations of w: 

w = aa ■ {ab) r (ac) s aa ■ (6a)*(ca) 9 • aaa 
= aaa ■ ( ba) r (ca) s ■ ( aby(ac) q aa ■ aa 

which are subset-invariant if and only if r = t and s = q. So 

ufs(L) H R = {aa(ab) r (ac) s aa(ba) r (ca) s aaa : r, s > 1}, 

which is not a CFL. □ 
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